-
Notifications
You must be signed in to change notification settings - Fork 224
Fixed writing nested/sliced arrays to parquet #1326
Conversation
98fbc94
to
b5fe680
Compare
Codecov ReportBase: 83.12% // Head: 83.12% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1326 +/- ##
=======================================
Coverage 83.12% 83.12%
=======================================
Files 370 370
Lines 40169 40241 +72
=======================================
+ Hits 33391 33451 +60
- Misses 6778 6790 +12
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@jorgecarleitao I added a test I am not really sure what the proper statistics should be in such a case. The values are correct only the nesting differs. Pyarrow does not nest the null count, whilst we do.
|
80359ce
to
8216c75
Compare
Oof, thank you so much, @ritchie46 ! |
8af9889
to
bbd2b3a
Compare
@jorgecarleitao This should be good to go. The coverage failure is unrelated. |
fixes #1323
We wrote invalid arrays to parquet because we did not take the offset of the list arrays into account. This got worse as we now slice leaf arrays when we write to pages and the offsets don't belong to the pages written anymore.
Because we don't use offsets, but dremel during reading/writing of the parquet this still read values, but just way too many.
This PR fixes the huges memory usage/ huge files and incorrect files as observed in #1323.